2C: Character Vector and Factor

Readings

From R Coding Basics: An Introduction to the Basics of Coding in R by Dr.Β Gaston Sanchez:

Topics

  • Character vector

  • Useful functions for character vector

  • Factor

Character vector

  • Data can be in the form of specific categories that describe a certain characteristic.

  • Categories such as gender, color, nationality, and college major are said to be nominal because they have no natural ordering.

  • In contrast, categories such as education level, customer satisfaction, and Likert scale responses, are ordinal because they do have a natural ordering.

  • In R, a character vector is often used to represent nominal categories.

# character vector
c('male', 'female', 'female', 'male', 'female')  
c('Math', 'Data Science', 'Math', 'Computer Science', 'Data Science')  

Useful functions for character vector

The nchar() function

  • nchar() counts the number of characters of each element.
v <- c('Data', 'science', 'is', 'fun', 'and', 'challenging')
nchar(v)
[1]  4  7  2  3  3 11

πŸ’» Hands-On

Write R code to count the number of characters of the following vector:

towns <- c('Glassboro', 'Clayton', 'Pitman', 'Deptford')
nchar(towns)
[1] 9 7 6 8

The strsplit() function

  • strsplit() splits a string into words.
strsplit('Data science is fun and challenging', split = ' ')
[[1]]
[1] "Data"        "science"     "is"          "fun"         "and"        
[6] "challenging"

πŸ’» Hands-On

Write R code extract words in the following string:

#> 'Name,Age,Major,GPA,Hobbies'
strsplit('Name,Age,Major,GPA,Hobbies', split = ',')
[[1]]
[1] "Name"    "Age"     "Major"   "GPA"     "Hobbies"

The paste() function

  • paste() concatenates elements of character vectors
paste(towns, 'NJ')
[1] "Glassboro NJ" "Clayton NJ"   "Pitman NJ"    "Deptford NJ" 
paste(towns, 'NJ', sep = ' ')
[1] "Glassboro NJ" "Clayton NJ"   "Pitman NJ"    "Deptford NJ" 
paste(towns, 'NJ', sep = ', ')
[1] "Glassboro, NJ" "Clayton, NJ"   "Pitman, NJ"    "Deptford, NJ" 
paste(towns, 'NJ', sep = ' in ')
[1] "Glassboro in NJ" "Clayton in NJ"   "Pitman in NJ"    "Deptford in NJ" 

πŸ’» Hands-On

Write R code to generate the following output:

#> [1] "Glassboro Township" "Clayton Township"   "Pitman Township"   
#> [4] "Deptford Township"
paste(towns, 'Township', sep = ' ')
[1] "Glassboro Township" "Clayton Township"   "Pitman Township"   
[4] "Deptford Township" 

πŸ’» Hands-On

What does the paste0() function does?

towns <- c('Glassboro', 'Clayton', 'Pitman', 'Deptford')

paste0(towns, 'NJ')
[1] "GlassboroNJ" "ClaytonNJ"   "PitmanNJ"    "DeptfordNJ" 

paste0 is a shortcut for paste(..., sep = '') (no separator)

towns <- c('Glassboro', 'Clayton', 'Pitman', 'Deptford')

paste0(towns, 'NJ')
[1] "GlassboroNJ" "ClaytonNJ"   "PitmanNJ"    "DeptfordNJ" 
  • Thanks to vectorization, paste() allows corresponding elements to be concatenated.
first_names <- c('James', 'Robert', 'Mary')
last_names  <- c('Smith', 'Williams', 'Brown')

paste(first_names, last_names)
[1] "James Smith"     "Robert Williams" "Mary Brown"     

πŸ’» Hands-On

Write R code to generate the following output:

#> [1] "Smith, James"     "Williams, Robert" "Brown, Mary"
paste(last_names, first_names, collapse = ', ')
[1] "Smith James, Williams Robert, Brown Mary"

πŸ’» Hands-On

Write R code to generate the following output:

#> [1] "Glassboro, NJ 08028" "Clayton, NJ 08312"   "Pitman, NJ 08071"   
#> [4] "Deptford, NJ 08096" 
zip_code <- c('08028', '08312', '08071', '08096')

paste0(towns, ', NJ ', zip_code)
[1] "Glassboro, NJ 08028" "Clayton, NJ 08312"   "Pitman, NJ 08071"   
[4] "Deptford, NJ 08096" 

Factor

  • A factor is a specialized data structure for categories, especially ordinal categories.

  • It behaves like a vector but has additional attributes.

  • A factor displays categories (levels) but stores them internally as integers.

  • A factor is created from a character vector using the factor() function.

### Nominal factor

#> [1] O  A  AB O  O  B  A  B  AB
#> Levels: A B AB O

### Ordinal factor

#>  [1] extra hot mild      hot       mild      medium    extra hot mild     
#>  [8] medium    extra hot medium    extra hot mild     
#> Levels: mild < medium < hot < extra hot

πŸ’» Hands-On

Create a nominal factor from the following character vector:

blood <- c('O', 'A', 'AB', 'O', 'O', 'B', 'A', 'B', 'AB')
factor(blood, levels = c('O', 'A', 'B', 'AB'))
[1] O  A  AB O  O  B  A  B  AB
Levels: O A B AB

πŸ’» Hands-On

Create an ordinal factor from the following character vector:

pepper <- c('extra hot', 'mild', 'hot', 'mild', 'medium', 'extra hot', 
            'mild', 'medium', 'extra hot', 'medium', 'extra hot', 'mild')
factor(pepper, levels = c('mild', 'medium', 'hot', 'extra hot'), ordered = TRUE)
 [1] extra hot mild      hot       mild      medium    extra hot mild     
 [8] medium    extra hot medium    extra hot mild     
Levels: mild < medium < hot < extra hot

πŸ’» Hands-On

Create an ordinal factor from the following character vector:

shirts <- c('L', 'XL', 'XL', 'L', 'M', 'S', 'S', 'XL', 'M', 'S', 'L', 'S')
factor(shirts, levels = c('S', 'M', 'L', 'XL'),
       labels = c('Small', 'Medium', 'Large', 'Extra Large'))
 [1] Large       Extra Large Extra Large Large       Medium      Small      
 [7] Small       Extra Large Medium      Small       Large       Small      
Levels: Small Medium Large Extra Large